Iterative normalization for machine learning applications

ABSTRACT

An embodiment of a semiconductor package apparatus may include technology to process one or more vectors with a sum of squares operation with a layer of a multi-layer neural network, and determine a fixed-point approximation for the sum of squares operation. Other embodiments are disclosed and claimed.

TECHNICAL FIELD

Embodiments generally relate to machine learning systems. Moreparticularly, embodiments relate to iterative normalization for machinelearning applications.

BACKGROUND

Multi-layer neural network technology has many applications, includemachine learning applications. Examples of machine learning applicationsinclude CAFFE, THEANO, APACHE SPARK, and MICROSOFT AZURE, all of whichmay utilize multi-layer neural network technology. Some multi-layerneural networks may include batch normalization technology.

BRIEF DESCRIPTION OF THE DRAWINGS

The various advantages of the embodiments will become apparent to oneskilled in the art by reading the following specification and appendedclaims, and by referencing the following drawings, in which:

FIG. 1 is a block diagram of an example of an electronic processingsystem according to an embodiment;

FIG. 2 is a block diagram of an example of a semiconductor packageapparatus according to an embodiment;

FIG. 3 is a flowchart of an example of a method of machine learningaccording to an embodiment;

FIG. 4 is a block diagram of an example of a multi-layer neural networkapparatus according to an embodiment;

FIG. 5 is a flowchart of an example of a method of determining a vectornorm according to an embodiment;

FIGS. 6A and 6B are block diagrams of examples of machine learningapparatuses according to embodiments;

FIG. 7 is a block diagram of an example of a processor according to anembodiment; and

FIG. 8 is a block diagram of an example of a system according to anembodiment.

DESCRIPTION OF EMBODIMENTS

Turning now to FIG. 1, an embodiment of a multi-layer neural networkapparatus 10 may include a first computational layer 11, and a secondcomputational layer 12 communicatively coupled to the firstcomputational layer 11. One or more of the first and secondcomputational layers 11, 12 may include logic 13 to process one or morevectors with a sum of squares operation, and determine a fixed-pointapproximation for the sum of squares operation. For example, the logic13 may be further configured to provide overflow protection for the sumof squares operation. In some embodiments, the logic 13 may be furtherconfigured to provide batch normalization for the one or more vectors.For example, the logic 13 may be configured to accumulate a runningvalue corresponding to a square root of the sum of squares operation.Some embodiments of the logic 13 may be further configured to determinea number of elements for the sum of squares operation based on athreshold value relative to a maximum fixed-point value, and accumulatethe running value corresponding to a square root of the sum of squaresoperation based on the determined number of elements. In someembodiments, the logic 13 may include further technology to provide oneor more of supervised and unsupervised learning for one or more of aspeech processing application, an image processing, a pattern processingapplication, a machine learning application, etc. In some embodiments,the computational layers 11, 12 and/or the logic 13 may be located in,or co-located with, various components, including a processor, memory,an inference engine, etc. (e.g., on a same die).

Embodiments of each of the above computational layers 11, 12, logic 13,and other components of the apparatus 10 may be implemented in hardware,software, or any suitable combination thereof. For example, hardwareimplementations may include configurable logic such as, for example,programmable logic arrays (PLAs), field programmable gate arrays(FPGAs), complex programmable logic devices (CPLDs), orfixed-functionality logic hardware using circuit technology such as, forexample, application specific integrated circuit (ASIC), complementarymetal oxide semiconductor (CMOS) or transistor-transistor logic (TTL)technology, or any combination thereof.

Alternatively, or additionally, all or portions of these components maybe implemented in one or more modules as a set of logic instructionsstored in a machine- or computer-readable storage medium such as randomaccess memory (RAM), read only memory (ROM), programmable ROM (PROM),firmware, flash memory, etc., to be executed by a processor or computingdevice. For example, computer program code to carry out the operationsof the components may be written in any combination of one or moreoperating system (OS) applicable/appropriate programming languages,including an object-oriented programming language such as PYTHON, PERL,JAVA, SMALLTALK, C++, C# or the like and conventional proceduralprogramming languages, such as the “C” programming language or similarprogramming languages. For example, persistent storage media, or othermemory may store a set of instructions which when executed by aprocessor cause the apparatus 10 to implement one or more components,features, or aspects of the apparatus 10 (e.g., the computational layers11, 12, the logic 13, processing the vectors with the sum of squaresoperation, determining the fixed-point approximation for the sum ofsquares operation, etc.). Embodiments of a suitable processor mayinclude a general purpose processor, a special purpose processor, acentral processor unit (CPU), a controller, a micro-controller, akernel, an execution unit, etc.

Turning now to FIG. 2, an embodiment of a semiconductor packageapparatus 20 may include one or more substrates 21, and logic 22 coupledto the one or more substrates 21, wherein the logic 22 is at leastpartly implemented in one or more of configurable logic andfixed-functionality hardware logic. The logic 22 coupled to the one ormore substrates 21 may be configured to process one or more vectors witha sum of squares operation with a layer of a multi-layer neural network,and determine a fixed-point approximation for the sum of squaresoperation. For example, the logic 22 may be further configured toprovide overflow protection for the sum of squares operation. In someembodiments, the logic 22 may be further configured to provide batchnormalization for the one or more vectors. For example, the logic 22 maybe configured to accumulate a running value corresponding to a squareroot of the sum of squares operation. Some embodiments of the logic 22may be further configured to determine a number of elements for the sumof squares operation based on a threshold value relative to a maximumfixed-point value, and accumulate the running value corresponding to asquare root of the sum of squares operation based on the determinednumber of elements. In some embodiments, the logic 22 may includefurther technology to provide one or more of supervised and unsupervisedlearning for one or more of a speech processing application, an imageprocessing, a pattern processing application, a machine learningapplication, etc. In some embodiments, the logic 22 coupled to the oneor more substrates 21 may include transistor channel regions that arepositioned within the one or more substrates 21.

Embodiments of logic 22, and other components of the apparatus 20, maybe implemented in hardware, software, or any combination thereofincluding at least a partial implementation in hardware. For example,hardware implementations may include configurable logic such as, forexample, PLAs, FPGAs, CPLDs, or fixed-functionality logic hardware usingcircuit technology such as, for example, ASIC, CMOS, or TTL technology,or any combination thereof. Additionally, portions of these componentsmay be implemented in one or more modules as a set of logic instructionsstored in a machine- or computer-readable storage medium such as RAM,ROM, PROM, firmware, flash memory, etc., to be executed by a processoror computing device. For example, computer program code to carry out theoperations of the components may be written in any combination of one ormore OS applicable/appropriate programming languages, including anobject-oriented programming language such as PYTHON, PERL, JAVA,SMALLTALK, C++, C# or the like and conventional procedural programminglanguages, such as the “C” programming language or similar programminglanguages.

The apparatus 20 may implement one or more aspects of the method 30(FIG. 3), or any of the embodiments discussed herein. In someembodiments, the illustrated apparatus 20 may include the one or moresubstrates 21 (e.g., silicon, sapphire, gallium arsenide) and the logic22 (e.g., transistor array and other integrated circuit/IC components)coupled to the substrate(s) 21. The logic 22 may be implemented at leastpartly in configurable logic or fixed-functionality logic hardware. Inone example, the logic 22 may include transistor channel regions thatare positioned (e.g., embedded) within the substrate(s) 21. Thus, theinterface between the logic 22 and the substrate(s) 21 may not be anabrupt junction. The logic 22 may also be considered to include anepitaxial layer that is grown on an initial wafer of the substrate(s)21.

Turning now to FIG. 3, an embodiment of a method 30 of machine learningmay include processing one or more vectors with a sum of squaresoperation with a layer of a multi-layer neural network at block 31, anddetermining a fixed-point approximation for the sum of squares operationat block 32. For example, the method 30 may further include providingoverflow protection for the sum of squares operation at block 33. Someembodiments of the method 30 may further include providing batchnormalization for the one or more vectors at block 34. The method 30 mayalso include accumulating a running value corresponding to a square rootof the sum of squares operation at block 35. For example, the method 30may include determining a number of elements for the sum of squaresoperation based on a threshold value relative to a maximum fixed-pointvalue at block 36, and accumulating the running value corresponding to asquare root of the sum of squares operation based on the determinednumber of elements at block 37. In some embodiments of the method 30,the multi-layer neural network may include further technology to provideone or more of supervised and unsupervised learning for one or more of aspeech processing application, an image processing, a pattern processingapplication, and a machine learning application at block 38.

Embodiments of the method 30 may be implemented in a system, apparatus,computer, device, etc., for example, such as those described herein.More particularly, hardware implementations of the method 30 may includeconfigurable logic such as, for example, PLAs, FPGAs, CPLDs, or infixed-functionality logic hardware using circuit technology such as, forexample, ASIC, CMOS, or TTL technology, or any combination thereof.Alternatively, or additionally, the method 30 may be implemented in oneor more modules as a set of logic instructions stored in a machine- orcomputer-readable storage medium such as RAM, ROM, PROM, firmware, flashmemory, etc., to be executed by a processor or computing device. Forexample, computer program code to carry out the operations of thecomponents may be written in any combination of one or more OSapplicable/appropriate programming languages, including anobject-oriented programming language such as PYTHON, PERL, JAVA,SMALLTALK, C++, C# or the like and conventional procedural programminglanguages, such as the “C” programming language or similar programminglanguages.

For example, the method 30 may be implemented on a computer readablemedium as described in connection with Examples 20 to 25 below.Embodiments or portions of the method 30 may be implemented in firmware,applications (e.g., through an application programming interface (API)),or driver software running on an operating system (OS). Additionally,logic instructions might include assembler instructions, instruction setarchitecture (ISA) instructions, machine instructions, machine dependentinstructions, microcode, state-setting data, configuration data forintegrated circuitry, state information that personalizes electroniccircuitry and/or other structural components that are native to hardware(e.g., host processor, central processing unit/CPU, microcontroller,etc.).

Turning now to FIG. 4, an embodiment of a multi-layer neural network 40may include two or more layers 41, where at least one of the layersincludes complex computations implemented with fixed pointapproximations of the complex computations. In a conventional neuralnetwork such complex computation may either be performed with precisioncircuits (e.g., floating point arithmetic logic units, higher precisioninteger math/logic) or may overflow the fixed point logic circuits.Precision circuits may involve substantial additional circuitry and/orexecution time (e.g., additional power and/or compute resources).Overflowing the fixed point logic may cause a fault in the operation.Advantageously, some embodiments may replace a complex computation withan appropriate approximation that can successfully be performed on thefixed point (e.g., integer) arithmetic logic or lower precision logicwithout overflowing. Utilizing lower precision fixed point arithmeticmay utilize fewer compute resources (e.g., power, silicon area, etc.) ascompared to higher precision units, and may allow deployment of a neuralnetwork in a less powerful computer system (e.g., a mobile device suchas an edge device, a tablet, a smartphone, etc.).

For example, the multi-layer neural network 40 may include one or morebatch normalization layers 42. Batch normalization may involve a sum ofsquares operation. Some embodiment may advantageously provide aniterative process for batch normalization to inhibit or preventoverflows in limited precision systems. For example, some embodimentsmay utilize an approximation to prevent overflows of sum of squaresduring a batch normalization operation in limited precision systems.Batch normalization in neural networks may involve the computation ofthe magnitude of vectors (e.g., the vector norm or “L2-norm” ofvectors). In limited precision arithmetic, such a sum of squaresoperation could overflow beyond the numbers that can be represented inthe system. Some embodiments may instead utilize an iterative process tocalculate a running value for the square root of the sum of squaresoperation (e.g., the L2-norm computation that is used in batchnormalization). Advantageously, some embodiments may avoid calculatinglarge values (e.g., sum of squares) that may lead to overflow. In someembodiments, the approximation may limit the largest sum of squaresvalues, which may advantageously help prevent overflows in batchnormalization calculations for limited precision arithmetic. Someembodiments may advantageously utilize fixed point or flex pointarithmetic for batch normalization (e.g., instead of floating point),which may provide power and/or performance advantages.

Embodiments of the layer(s) 41, the batch normalization layer(s) 42, andother components of the multi-layer neural network 40, may beimplemented in hardware, software, or any combination thereof includingat least a partial implementation in hardware. For example, hardwareimplementations may include configurable logic such as, for example,PLAs, FPGAs, CPLDs, or fixed-functionality logic hardware using circuittechnology such as, for example, ASIC, CMOS, or TTL technology, or anycombination thereof. Additionally, portions of these components may beimplemented in one or more modules as a set of logic instructions storedin a machine- or computer-readable storage medium such as RAM, ROM,PROM, firmware, flash memory, etc., to be executed by a processor orcomputing device. For example, computer program code to carry out theoperations of the components may be written in any combination of one ormore OS applicable/appropriate programming languages, including anobject-oriented programming language such as PYTHON, PERL, JAVA,SMALLTALK, C++, C# or the like and conventional procedural programminglanguages, such as the “C” programming language or similar programminglanguages.

Batch normalization in neural networks may involve the computation of aL2-norm of vectors. In limited precision arithmetic, sum of squaresoperations could overflow beyond the numbers that can be represented inthe system. If the vector x is represented as:

x=[x ₁ ,x ₂ , . . . ,x]  [Eq. 1]

A straightforward way of calculating the L2-norm for this vector may berepresented as:

∥x∥ ₂ =y _((1,N))=√{square root over (Σ_(i=1) ^(N) x _(i) ²)}  [Eq. 2]

where y_((1,N)) denotes the square root of the sum of squares forelements 1 through N. For large values of N, adding N of the squaredx_(i) elements could lead to overflow. For some applications, ultimatelythe square root of this sum may be needed, which is much smaller thanthe sum. Some embodiments may advantageously determine the small number(e.g., the square root) without having to calculate the larger sum ofsquares.

For example, some embodiments may determine an approximation for thesmaller number without performing the intermediate calculations of thelarger number. Any suitable approximation may be used, based on theparticular application. For example, some embodiments may calculate thesquare root of the sum of squares of two numbers with the followingapproximation:

$\begin{matrix}{\sqrt{a^{2} + b^{2}} \approx {a + \frac{b^{2}}{2a}}} & \lbrack {{Eq}.\mspace{14mu} 3} \rbrack\end{matrix}$

with the condition that 0<b<0.5a. Under this condition, the error in theapproximation is at most 0.64%. This error drops to about 0.02% whenb/a=0.2.

Turning now to FIG. 5, an embodiment of a method 50 of determining theL2-norm of the vector x with N elements may include determining a squareroot of the sum of squares for the first K samples of the vector x atblock 51. For example, some embodiments may take the first K samples ofthe vector x, and calculate the square root of the sum squares forelements 1 through K utilizing fixed point processing as follows:

y _((1,K))=√{square root over (Σ_(i=1) ^(K) x _(i) ²)}  [Eq. 4]

Some embodiments may choose a value for K which is much less than N(K«N) so that the fixed point operation is unlikely to overflow.Alternatively, some embodiments may iteratively sum the squares untilthe square of the current element exceeds a threshold value. Forexample, if the maximum positive value that can be represented in thesystem is A_(max), some embodiments may continue summing the x_(i) ²elements until half of the maximum positive value is reached (e.g.,threshold=A_(max)/2). The value for K may then be set to the position ofthe last element successfully summed.

The method 50 may then include determining a square root of the sum ofsquares for the next M samples of the vector x at block 52. For example,some embodiments may take the elements from K+1 through K+M for vectorx, and calculate the square root of the sum squares for those elementsutilizing fixed point processing as follows:

y _((K+1,K+M))=√{square root over (Σ_(i=K+1) ^(K+M) x _(i) ²)}  [Eq. 5]

where a value for M is selected to be small enough such thaty_((K+1,K+M)) is unlikely to overflow, and y_((K+1,K+M))<0.5y_((1,K)).Alternatively, some embodiments may iteratively sum the squares until acondition is met. For example, some embodiments may continue summing Mvalues until the condition y_((K+1,K+M))≈0.2 y_((1,K)) is reached.

The method 50 may then include approximating a value for the square rootof the sum of squares for elements 1 through K+M at block 53, based onthe values determined at blocks 51 and 52. For example, instead ofcalculating the value y_((1,K+M)) as follows (e.g., which may involvesumming two large numbers):

y _((1,K+M))=√{square root over (Σ_(i=1) ^(K+M) x _(i) ²)}=√{square rootover (y _((1,K)) ² +y _((K+1,K+M)) ²)}  [Eq. 6]

Some embodiments may approximate the value as follows:

$\begin{matrix}{y_{({1,{K + M}})} = {\sqrt{y_{({1,K})}^{2} + y_{({{K + 1},{K + M}})}^{2}} \approx {y_{({1,K})} + \frac{y_{({{K + 1},{K + M}})}^{2}}{2y_{({1,K})}}}}} & \lbrack {{Eq}.\mspace{14mu} 7} \rbrack\end{matrix}$

Advantageously, some embodiments may determine an accurate approximationfor y_((1,K+M)) while avoiding summing the squares of all numbers from 1to K+M in one operation, which helps avoid overflows. Moreover, aftercalculating the two smaller sums of squares (y_((1,K)) ²) andy_((K+1,K+M)) ², these two sum of squares are also not added directly(e.g., as they are in Eq. 6). Instead, an appropriate approximation (seeEq. 7) is utilized to help avoid overflow.

The method 50 may then determine if all elements have been consumed atblock 54. If so, the method 50 may be done at block 55. Otherwise, themethod 50 may set the value for K to be equal to K+M at block 56 (e.g.,K=K+M), and the method 50 may return to block 52 to consume the next Melements of the vector x. For example, the previous value fory_((1,K+M)) may be iteratively utilized to determine the approximationfor the next value of y_((1,K+2M)), until all N elements are consumed.

Some other systems may avoid overflow issues for complex computations inneural networks (e.g., sum of square operations) by performing thosecomputations in a native higher precision mode (e.g., 48-bits vs.16-bits). For matrix multiplications, the higher precision mode mayresult in significantly higher power consumption. In some architectures,using the higher precision mode may cause lower performance and/orhigher power consumption. Some embodiments may advantageously utilizeapproximations for the complex computations, which may be performed inlower precision with higher performance and/or lower power consumption.In addition, some embodiments may improve or optimize memory traffic toretrieve elements of the vector to be batch normalized. For example,some architectures may utilize specialized kernels for complexcomputations (e.g., a batchnorm kernel) which involves memory intensiveoperations. Some embodiments may advantageously reduce the number ofmemory reads needed to complete a batch normalization operation.

For example, some implementations of a batch normalization kernel mayrequire the calculation of mean (μ) and standard deviation (σ) of aminibatch of N samples x=(x₁, . . . , x_(N)), where σ is given by:

$\begin{matrix}{\sigma = \sqrt{\frac{1}{N}{\Sigma_{i = 1}^{N}( {x_{i} - \mu} )}^{2}}} & \lbrack {{Eq}.\mspace{14mu} 8} \rbrack\end{matrix}$

The mean and variance calculation may result in three (3) memoryoperations including (1) reading all the current minibatch data (x₁, . .. , x_(N)) to calculate the mean μ; (2) read each value again tosubtract the mean from each of these values (x_(i)−μ) and write thecalculations back; and (3) read these values (x_(i)−μ), calculating thesum squares using the high precision calculation (48-bit values), andthen calculating the variance σ. Some embodiments may calculate thesquare-root of sum squares, namely y_((1,N)), as the batch samplesarrive. For example, the 3 step operation may be utilized because of theoverflow risk in calculating of y_((1,N)) using the traditionalapproach. Because some embodiments may calculate y_((1,N)) as the batchsamples arrive, only the first memory operation would be needed. Thevariance can then be calculated as:

$\begin{matrix}\begin{matrix}{\sigma = {{{E\lbrack x^{2} \rbrack} - ( {E\lbrack x\rbrack} )^{2}} = {{E\lbrack x^{2} \rbrack} - \mu^{2}}}} \\{= {( {\sqrt{E\lbrack x^{2} \rbrack} - \mu} )( {\sqrt{E\lbrack x^{2} \rbrack} + \mu} )}} \\{= {( {\sqrt{\frac{1}{N}\Sigma_{i = 1}^{N}x_{i}^{2}} - \mu} )( {\sqrt{\frac{1}{N}\Sigma_{i = 1}^{N}x_{i}^{2}} + \mu} )}} \\{= {( {\frac{y_{({1,N})}}{\sqrt{N}} - \mu} )( {\frac{y_{({1,N})}}{\sqrt{N}} + \mu} )}}\end{matrix} & \lbrack {{Eq}.\mspace{14mu} 9} \rbrack\end{matrix}$

For example, because y_((1,N)) may be calculated in an overflow-free wayas the batch samples arrive, σ² may be computed at the end of the firstmemory read operation, resulting in a significant performanceimprovement.

FIG. 6A shows a machine learning apparatus 132 (132 a-132 b) that mayimplement one or more aspects of the method 30 (FIG. 3) and/or themethod 50 (FIG. 5). The machine learning apparatus 132, which mayinclude logic instructions, configurable logic, fixed-functionalityhardware logic, may be readily substituted for the apparatus 10 (FIG. 1)and/or the apparatus 40 (FIG. 4), already discussed. A vector processor132 a may include technology to process one or more vectors, and a fixedpoint approximator 132 b may include technology to determine afixed-point approximation for a sum of squares operation on the one ormore vectors (e.g., as a layer of a multi-layer neural network). Forexample, the fixed point approximator 132 b may be further configured toprovide overflow protection for the sum of squares operation. In someembodiments, the fixed point approximator 132 b may be furtherconfigured to provide batch normalization for the one or more vectors.For example, the fixed point approximator 132 b may be configured toaccumulate a running value corresponding to a square root of the sum ofsquares operation. Some embodiments of the fixed point approximator 132b may be further configured to determine a number of elements for thesum of squares operation based on a threshold value relative to amaximum fixed-point value, and accumulate the running valuecorresponding to a square root of the sum of squares operation based onthe determined number of elements. In some embodiments, the machinelearning apparatus 132 may include further technology to provide one ormore of supervised and unsupervised learning for one or more of a speechprocessing application, an image processing, a pattern processingapplication, a machine learning application, etc.

Turning now to FIG. 6B, machine learning apparatus 134 (134 a, 134 b) isshown in which logic 134 b (e.g., transistor array and other integratedcircuit/IC components) is coupled to a substrate 134 a (e.g., silicon,sapphire, gallium arsenide). The logic 134 b may generally implement oneor more aspects of the method 30 (FIG. 3) and/or the method 50 (FIG. 5).Thus, the logic 134 b may process one or more vectors with a sum ofsquares operation with a layer of a multi-layer neural network, anddetermine a fixed-point approximation for the sum of squares operation.For example, the logic 134 b may be further configured to provideoverflow protection for the sum of squares operation. In someembodiments, the logic 134 b may be further configured to provide batchnormalization for the one or more vectors. For example, the logic 134 bmay be configured to accumulate a running value corresponding to asquare root of the sum of squares operation. Some embodiments of thelogic 134 b may be further configured to determine a number of elementsfor the sum of squares operation based on a threshold value relative toa maximum fixed-point value, and accumulate the running valuecorresponding to a square root of the sum of squares operation based onthe determined number of elements. In some embodiments, the logic 134 bmay include further technology to provide one or more of supervised andunsupervised learning for one or more of a speech processingapplication, an image processing, a pattern processing application, amachine learning application, etc. In one example, the apparatus 134 isa semiconductor die, chip and/or package.

FIG. 7 illustrates a processor core 200 according to one embodiment. Theprocessor core 200 may be the core for any type of processor, such as amicro-processor, an embedded processor, a digital signal processor(DSP), a network processor, or other device to execute code. Althoughonly one processor core 200 is illustrated in FIG. 7, a processingelement may alternatively include more than one of the processor core200 illustrated in FIG. 7. The processor core 200 may be asingle-threaded core or, for at least one embodiment, the processor core200 may be multithreaded in that it may include more than one hardwarethread context (or “logical processor”) per core.

FIG. 7 also illustrates a memory 270 coupled to the processor core 200.The memory 270 may be any of a wide variety of memories (includingvarious layers of memory hierarchy) as are known or otherwise availableto those of skill in the art. The memory 270 may include one or morecode 213 instruction(s) to be executed by the processor core 200,wherein the code 213 may implement one or more aspects of the method 30(FIG. 3) and/or the method 50 (FIG. 5), already discussed. The processorcore 200 follows a program sequence of instructions indicated by thecode 213. Each instruction may enter a front end portion 210 and beprocessed by one or more decoders 220. The decoder 220 may generate asits output a micro operation such as a fixed width micro operation in apredefined format, or may generate other instructions,microinstructions, or control signals which reflect the original codeinstruction. The illustrated front end portion 210 also includesregister renaming logic 225 and scheduling logic 230, which generallyallocate resources and queue the operation corresponding to the convertinstruction for execution.

The processor core 200 is shown including execution logic 250 having aset of execution units 255-1 through 255-N. Some embodiments may includea number of execution units dedicated to specific functions or sets offunctions. Other embodiments may include only one execution unit or oneexecution unit that can perform a particular function. The illustratedexecution logic 250 performs the operations specified by codeinstructions.

After completion of execution of the operations specified by the codeinstructions, back end logic 260 retires the instructions of the code213. In one embodiment, the processor core 200 allows out of orderexecution but requires in order retirement of instructions. Retirementlogic 265 may take a variety of forms as known to those of skill in theart (e.g., re-order buffers or the like). In this manner, the processorcore 200 is transformed during execution of the code 213, at least interms of the output generated by the decoder, the hardware registers andtables utilized by the register renaming logic 225, and any registers(not shown) modified by the execution logic 250.

Although not illustrated in FIG. 7, a processing element may includeother elements on chip with the processor core 200. For example, aprocessing element may include memory control logic along with theprocessor core 200. The processing element may include I/O control logicand/or may include I/O control logic integrated with memory controllogic. The processing element may also include one or more caches.

Referring now to FIG. 8, shown is a block diagram of a system 1000embodiment in accordance with an embodiment. Shown in FIG. 8 is amultiprocessor system 1000 that includes a first processing element 1070and a second processing element 1080. While two processing elements 1070and 1080 are shown, it is to be understood that an embodiment of thesystem 1000 may also include only one such processing element.

The system 1000 is illustrated as a point-to-point interconnect system,wherein the first processing element 1070 and the second processingelement 1080 are coupled via a point-to-point interconnect 1050. Itshould be understood that any or all of the interconnects illustrated inFIG. 8 may be implemented as a multi-drop bus rather than point-to-pointinterconnect.

As shown in FIG. 8, each of processing elements 1070 and 1080 may bemulticore processors, including first and second processor cores (i.e.,processor cores 1074 a and 1074 b and processor cores 1084 a and 1084b). Such cores 1074 a, 1074 b, 1084 a, 1084 b may be configured toexecute instruction code in a manner similar to that discussed above inconnection with FIG. 7.

Each processing element 1070, 1080 may include at least one shared cache1896 a, 1896 b (e.g., static random access memory/SRAM). The sharedcache 1896 a, 1896 b may store data (e.g., objects, instructions) thatare utilized by one or more components of the processor, such as thecores 1074 a, 1074 b and 1084 a, 1084 b, respectively. For example, theshared cache 1896 a, 1896 b may locally cache data stored in a memory1032, 1034 for faster access by components of the processor. In one ormore embodiments, the shared cache 1896 a, 1896 b may include one ormore mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4),or other levels of cache, a last level cache (LLC), and/or combinationsthereof.

While shown with only two processing elements 1070, 1080, it is to beunderstood that the scope of the embodiments are not so limited. Inother embodiments, one or more additional processing elements may bepresent in a given processor. Alternatively, one or more of processingelements 1070, 1080 may be an element other than a processor, such as anaccelerator or a field programmable gate array. For example, additionalprocessing element(s) may include additional processors(s) that are thesame as a first processor 1070, additional processor(s) that areheterogeneous or asymmetric to processor a first processor 1070,accelerators (such as, e.g., graphics accelerators or digital signalprocessing (DSP) units), field programmable gate arrays, or any otherprocessing element. There can be a variety of differences between theprocessing elements 1070, 1080 in terms of a spectrum of metrics ofmerit including architectural, micro architectural, thermal, powerconsumption characteristics, and the like. These differences mayeffectively manifest themselves as asymmetry and heterogeneity amongstthe processing elements 1070, 1080. For at least one embodiment, thevarious processing elements 1070, 1080 may reside in the same diepackage.

The first processing element 1070 may further include memory controllerlogic (MC) 1072 and point-to-point (P-P) interfaces 1076 and 1078.Similarly, the second processing element 1080 may include a MC 1082 andP-P interfaces 1086 and 1088. As shown in FIG. 8, MC's 1072 and 1082couple the processors to respective memories, namely a memory 1032 and amemory 1034, which may be portions of main memory locally attached tothe respective processors. While the MC 1072 and 1082 is illustrated asintegrated into the processing elements 1070, 1080, for alternativeembodiments the MC logic may be discrete logic outside the processingelements 1070, 1080 rather than integrated therein.

The first processing element 1070 and the second processing element 1080may be coupled to an I/O subsystem 1090 via P-P interconnects 1076 1086,respectively. As shown in FIG. 8, the I/O subsystem 1090 includes a TEE1097 (e.g., security controller) and P-P interfaces 1094 and 1098.Furthermore, I/O subsystem 1090 includes an interface 1092 to couple I/Osubsystem 1090 with a high performance graphics engine 1038. In oneembodiment, bus 1049 may be used to couple the graphics engine 1038 tothe I/O subsystem 1090. Alternately, a point-to-point interconnect maycouple these components.

In turn, I/O subsystem 1090 may be coupled to a first bus 1016 via aninterface 1096. In one embodiment, the first bus 1016 may be aPeripheral Component Interconnect (PCI) bus, or a bus such as a PCIExpress bus or another third generation I/O interconnect bus, althoughthe scope of the embodiments are not so limited.

As shown in FIG. 8, various I/O devices 1014 (e.g., cameras, sensors)may be coupled to the first bus 1016, along with a bus bridge 1018 whichmay couple the first bus 1016 to a second bus 1020. In one embodiment,the second bus 1020 may be a low pin count (LPC) bus. Various devicesmay be coupled to the second bus 1020 including, for example, akeyboard/mouse 1012, network controllers/communication device(s) 1026(which may in turn be in communication with a computer network), and adata storage unit 1019 such as a disk drive or other mass storage devicewhich may include code 1030, in one embodiment. The code 1030 mayinclude instructions for performing embodiments of one or more of themethods described above. Thus, the illustrated code 1030 may implementone or more aspects of the method 30 (FIG. 3) and/or the method 50 (FIG.5), already discussed, and may be similar to the code 213 (FIG. 7),already discussed. Further, an audio I/O 1024 may be coupled to secondbus 1020.

Note that other embodiments are contemplated. For example, instead ofthe point-to-point architecture of FIG. 8, a system may implement amulti-drop bus or another such communication topology.

ADDITIONAL NOTES AND EXAMPLES

Example 1 may include a multi-layer neural network apparatus, comprisinga first computational layer, and a second computational layercommunicatively coupled to the first computational layer, wherein one ormore of the first and second computational layers include logic toprocess one or more vectors with a sum of squares operation, anddetermine a fixed-point approximation for the sum of squares operation.

Example 2 may include the apparatus of Example 1, wherein the logic isfurther to provide overflow protection for the sum of squares operation.

Example 3 may include the apparatus of any of Examples 1 to 2, whereinthe logic is further to provide batch normalization for the one or morevectors.

Example 4 may include the apparatus of any of Examples 1 to 3, whereinthe logic is further to accumulate a running value corresponding to asquare root of the sum of squares operation.

Example 5 may include the apparatus of Example 4, wherein the logic isfurther to determine a number of elements for the sum of squaresoperation based on a threshold value relative to a maximum fixed-pointvalue, and accumulate the running value corresponding to a square rootof the sum of squares operation based on the determined number ofelements.

Example 6 may include the apparatus of any of Examples 1 to 5, whereinthe logic is further to provide one or more of supervised andunsupervised learning for one or more of a speech processingapplication, an image processing, a pattern processing application, anda machine learning application.

Example 7 may include a semiconductor package apparatus, comprising oneor more substrates, and logic coupled to the one or more substrates,wherein the logic is at least partly implemented in one or more ofconfigurable logic and fixed-functionality hardware logic, the logiccoupled to the one or more substrates to process one or more vectorswith a sum of squares operation with a layer of a multi-layer neuralnetwork, and determine a fixed-point approximation for the sum ofsquares operation.

Example 8 may include the apparatus of Example 7, wherein the logic isfurther to provide overflow protection for the sum of squares operation.

Example 9 may include the apparatus of any of Examples 7 to 8, whereinthe logic is further to provide batch normalization for the one or morevectors.

Example 10 may include the apparatus of any of Examples 7 to 9, whereinthe logic is further to accumulate a running value corresponding to asquare root of the sum of squares operation.

Example 11 may include the apparatus of Example 10, wherein the logic isfurther to determine a number of elements for the sum of squaresoperation based on a threshold value relative to a maximum fixed-pointvalue, and accumulate the running value corresponding to a square rootof the sum of squares operation based on the determined number ofelements.

Example 12 may include the apparatus of any of Examples 7 to 11, whereinthe multi-layer neural network is further to provide one or more ofsupervised and unsupervised learning for one or more of a speechprocessing application, an image processing, a pattern processingapplication, and a machine learning application.

Example 13 may include the apparatus of any of Examples 7 to 12, whereinthe logic coupled to the one or more substrates includes transistorchannel regions that are positioned within the one or more substrates.

Example 14 may include a method of machine learning, comprisingprocessing one or more vectors with a sum of squares operation with alayer of a multi-layer neural network, and determining a fixed-pointapproximation for the sum of squares operation.

Example 15 may include the method of Example 14, further comprisingproviding overflow protection for the sum of squares operation.

Example 16 may include the method of any of Examples 14 to 15, furthercomprising providing batch normalization for the one or more vectors.

Example 17 may include the method of any of Examples 14 to 16, furthercomprising accumulating a running value corresponding to a square rootof the sum of squares operation.

Example 18 may include the method of Example 17, further comprisingdetermining a number of elements for the sum of squares operation basedon a threshold value relative to a maximum fixed-point value, andaccumulating the running value corresponding to a square root of the sumof squares operation based on the determined number of elements.

Example 19 may include the method of any of Examples 14 to 18, whereinthe multi-layer neural network is further to provide one or more ofsupervised and unsupervised learning for one or more of a speechprocessing application, an image processing, a pattern processingapplication, and a machine learning application.

Example 20 may include at least one computer readable storage medium,comprising a set of instructions, which when executed by a computingdevice, cause the computing device to process one or more vectors with asum of squares operation with a layer of a multi-layer neural network,and determine a fixed-point approximation for the sum of squaresoperation.

Example 21 may include the at least one computer readable storage mediumof Example 20, comprising a further set of instructions, which whenexecuted by the computing device, cause the computing device to provideoverflow protection for the sum of squares operation.

Example 22 may include the at least one computer readable storage mediumof any of Examples 20 to 21, comprising a further set of instructions,which when executed by the computing device, cause the computing deviceto provide batch normalization for the one or more vectors.

Example 23 may include the at least one computer readable storage mediumof any of Examples 20 to 22, comprising a further set of instructions,which when executed by the computing device, cause the computing deviceto accumulate a running value corresponding to a square root of the sumof squares operation.

Example 24 may include the at least one computer readable storage mediumof Example 23, comprising a further set of instructions, which whenexecuted by the computing device, cause the computing device todetermine a number of elements for the sum of squares operation based ona threshold value relative to a maximum fixed-point value, andaccumulate the running value corresponding to a square root of the sumof squares operation based on the determined number of elements.

Example 25 may include the at least one computer readable medium storagemedium of any of Examples 20 to 24, wherein the multi-layer neuralnetwork is further to provide one or more of supervised and unsupervisedlearning for one or more of a speech processing application, an imageprocessing, a pattern processing application, and a machine learningapplication.

Example 26 may include a machine learning apparatus, comprising meansfor processing one or more vectors with a sum of squares operation witha layer of a multi-layer neural network, and means for determining afixed-point approximation for the sum of squares operation.

Example 27 may include the apparatus of Example 26, further comprisingmeans for providing overflow protection for the sum of squaresoperation.

Example 28 may include the apparatus of any of Examples 26 to 27,further comprising means for providing batch normalization for the oneor more vectors.

Example 29 may include the apparatus of any of Examples 26 to 28,further comprising means for accumulating a running value correspondingto a square root of the sum of squares operation.

Example 30 may include the apparatus of Example 29, further comprisingmeans for determining a number of elements for the sum of squaresoperation based on a threshold value relative to a maximum fixed-pointvalue, and means for accumulating the running value corresponding to asquare root of the sum of squares operation based on the determinednumber of elements.

Example 31 may include the apparatus of any of Examples 26 to 30,wherein the multi-layer neural network is further to provide one or moreof supervised and unsupervised learning for one or more of a speechprocessing application, an image processing, a pattern processingapplication, and a machine learning application.

Embodiments are applicable for use with all types of semiconductorintegrated circuit (“IC”) chips. Examples of these IC chips include butare not limited to processors, controllers, chipset components,programmable logic arrays (PLAs), memory chips, network chips, systemson chip (SoCs), SSD/NAND controller ASICs, and the like. In addition, insome of the drawings, signal conductor lines are represented with lines.Some may be different, to indicate more constituent signal paths, have anumber label, to indicate a number of constituent signal paths, and/orhave arrows at one or more ends, to indicate primary information flowdirection. This, however, should not be construed in a limiting manner.Rather, such added detail may be used in connection with one or moreexemplary embodiments to facilitate easier understanding of a circuit.Any represented signal lines, whether or not having additionalinformation, may actually comprise one or more signals that may travelin multiple directions and may be implemented with any suitable type ofsignal scheme, e.g., digital or analog lines implemented withdifferential pairs, optical fiber lines, and/or single-ended lines.

Example sizes/models/values/ranges may have been given, althoughembodiments are not limited to the same. As manufacturing techniques(e.g., photolithography) mature over time, it is expected that devicesof smaller size could be manufactured. In addition, well knownpower/ground connections to IC chips and other components may or may notbe shown within the figures, for simplicity of illustration anddiscussion, and so as not to obscure certain aspects of the embodiments.Further, arrangements may be shown in block diagram form in order toavoid obscuring embodiments, and also in view of the fact that specificswith respect to implementation of such block diagram arrangements arehighly dependent upon the platform within which the embodiment is to beimplemented, i.e., such specifics should be well within purview of oneskilled in the art. Where specific details (e.g., circuits) are setforth in order to describe example embodiments, it should be apparent toone skilled in the art that embodiments can be practiced without, orwith variation of, these specific details. The description is thus to beregarded as illustrative instead of limiting.

The term “coupled” may be used herein to refer to any type ofrelationship, direct or indirect, between the components in question,and may apply to electrical, mechanical, fluid, optical,electromagnetic, electromechanical or other connections. In addition,the terms “first”, “second”, etc. may be used herein only to facilitatediscussion, and carry no particular temporal or chronologicalsignificance unless otherwise indicated.

As used in this application and in the claims, a list of items joined bythe term “one or more of” may mean any combination of the listed terms.For example, the phrase “one or more of A, B, and C” and the phrase “oneor more of A, B, or C” both may mean A; B; C; A and B; A and C; B and C;or A, B and C.

Those skilled in the art will appreciate from the foregoing descriptionthat the broad techniques of the embodiments can be implemented in avariety of forms. Therefore, while the embodiments have been describedin connection with particular examples thereof, the true scope of theembodiments should not be so limited since other modifications willbecome apparent to the skilled practitioner upon a study of thedrawings, specification, and following claims.

We claim:
 1. A multi-layer neural network apparatus, comprising: a firstcomputational layer; and a second computational layer communicativelycoupled to the first computational layer, wherein one or more of thefirst and second computational layers include logic to: process one ormore vectors with a sum of squares operation, and determine afixed-point approximation for the sum of squares operation.
 2. Theapparatus of claim 1, wherein the logic is further to: provide overflowprotection for the sum of squares operation.
 3. The apparatus of claim1, wherein the logic is further to: provide batch normalization for theone or more vectors.
 4. The apparatus of claim 1, wherein the logic isfurther to: accumulate a running value corresponding to a square root ofthe sum of squares operation.
 5. The apparatus of claim 4, wherein thelogic is further to: determine a number of elements for the sum ofsquares operation based on a threshold value relative to a maximumfixed-point value; and accumulate the running value corresponding to asquare root of the sum of squares operation based on the determinednumber of elements.
 6. The apparatus of claim 1, wherein the logic isfurther to provide one or more of supervised and unsupervised learningfor one or more of a speech processing application, an image processing,a pattern processing application, and a machine learning application. 7.A semiconductor package apparatus, comprising: one or more substrates;and logic coupled to the one or more substrates, wherein the logic is atleast partly implemented in one or more of configurable logic andfixed-functionality hardware logic, the logic coupled to the one or moresubstrates to: process one or more vectors with a sum of squaresoperation with a layer of a multi-layer neural network, and determine afixed-point approximation for the sum of squares operation.
 8. Theapparatus of claim 7, wherein the logic is further to: provide overflowprotection for the sum of squares operation.
 9. The apparatus of claim7, wherein the logic is further to: provide batch normalization for theone or more vectors.
 10. The apparatus of claim 7, wherein the logic isfurther to: accumulate a running value corresponding to a square root ofthe sum of squares operation.
 11. The apparatus of claim 10, wherein thelogic is further to: determine a number of elements for the sum ofsquares operation based on a threshold value relative to a maximumfixed-point value; and accumulate the running value corresponding to asquare root of the sum of squares operation based on the determinednumber of elements.
 12. The apparatus of claim 7, wherein themulti-layer neural network is further to provide one or more ofsupervised and unsupervised learning for one or more of a speechprocessing application, an image processing, a pattern processingapplication, and a machine learning application.
 13. The apparatus ofclaim 7, wherein the logic coupled to the one or more substratesincludes transistor channel regions that are positioned within the oneor more substrates.
 14. A method of machine learning, comprising:processing one or more vectors with a sum of squares operation with alayer of a multi-layer neural network; and determining a fixed-pointapproximation for the sum of squares operation.
 15. The method of claim14, further comprising: providing overflow protection for the sum ofsquares operation.
 16. The method of claim 14, further comprising:providing batch normalization for the one or more vectors.
 17. Themethod of claim 14, further comprising: accumulating a running valuecorresponding to a square root of the sum of squares operation.
 18. Themethod of claim 17, further comprising: determining a number of elementsfor the sum of squares operation based on a threshold value relative toa maximum fixed-point value; and accumulating the running valuecorresponding to a square root of the sum of squares operation based onthe determined number of elements.
 19. The method of claim 14, whereinthe multi-layer neural network is further to provide one or more ofsupervised and unsupervised learning for one or more of a speechprocessing application, an image processing, a pattern processingapplication, and a machine learning application.
 20. At least onecomputer readable storage medium, comprising a set of instructions,which when executed by a computing device, cause the computing deviceto: process one or more vectors with a sum of squares operation with alayer of a multi-layer neural network; and determine a fixed-pointapproximation for the sum of squares operation.
 21. The at least onecomputer readable storage medium of claim 20, comprising a further setof instructions, which when executed by the computing device, cause thecomputing device to: provide overflow protection for the sum of squaresoperation.
 22. The at least one computer readable storage medium ofclaim 20, comprising a further set of instructions, which when executedby the computing device, cause the computing device to: provide batchnormalization for the one or more vectors.
 23. The at least one computerreadable storage medium of claim 20, comprising a further set ofinstructions, which when executed by the computing device, cause thecomputing device to: accumulate a running value corresponding to asquare root of the sum of squares operation.
 24. The at least onecomputer readable storage medium of claim 23, comprising a further setof instructions, which when executed by the computing device, cause thecomputing device to: determine a number of elements for the sum ofsquares operation based on a threshold value relative to a maximumfixed-point value; and accumulate the running value corresponding to asquare root of the sum of squares operation based on the determinednumber of elements.
 25. The at least one computer readable mediumstorage medium of claim 20, wherein the multi-layer neural network isfurther to provide one or more of supervised and unsupervised learningfor one or more of a speech processing application, an image processing,a pattern processing application, and a machine learning application.